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MORGAN A. BISHOP, 1 ARKADII G. D’YACHKOV, 2 ANTHONY J. MACULA, 3 
THOMAS E. RENZ, 4 and VYACHESLAV V. RYKOV 5 


ABSTRACT 

DNA nanotechnology often requires collections of oligonucleotides called “DNA free energy 
gap codes” that do not produce erroneous crosshybridizations in a competitive muliplexing 
environment. This paper addresses the question of how to design these codes to accomplish 
a desired amount of work within an acceptable error rate. Using a statistical thermody¬ 
namic and probabilistic model of DNA code fidelity and mathematical random coding theory 
methods, theoretical lower bounds on the size of DNA codes are given. More importantly, 
DNA code design parameters (e.g., strand number, strand length and sequence composition) 
needed to achieve experimental goals are identified. 

Key words: biomolecular computing, crosshybridization, DNA barcodes, DNA codes, DNA 
computing, DNA words, hybridization, nearest neighbor, random coding methods, self assembly, 
stacked pairs, statistical thermodynamics, SynDCode, tag-antitag systems, universal arrays. 

1. INTRODUCTION 

D NA nanotechnology often requires collections of oligonucleotides that do not produce erroneous 
crosshybridizations. When these collections consist of complementary pairs of oligonucleotides, i.e., 
are closed under complementation, they are called DNA tag-antitag systems (Kaderali et al., 2003) and DNA 
codes (D’yachkov et al., 2003, 2005b, 2006). When the collections need not be closed under complemen¬ 
tation they are called DNA words (Andronescu et al., 2003; Tulpan et al., 2005; Shortreed et al., 2005) and 
DNA barcodes (Eason et al., 2004). These collections of non-crosshybridizing collections have applications 
in SNP multiplexing (Cai et al., 2000; Kaderali et al., 2003; Fish et al., 2007), gene function identification 
(Eason et al., 2004), nanostructure self-assembly (Valignat et al., 2005), universal microarrays (Hardenbol 
et al., 2003), and biomolecular computing (Braich et al., 2002; Rose et al., 2004). Combinatorial, heuristic, 
and biological methods have been suggested as a means by which DNA codes can be found and programs 
exist that generate DNA codes (Andronescu et al., 2003; Bishop et al., 2006; Chen et al., 2006; Tulpan 
et al., 2005; D’yachkov et al., 2006; Penchovsky and Ackermann, 2003). 
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In several papers (Zhang et al., 2005; Tulpan et al., 2005; Dirks et al., 2004,2007; Rose et al., 1999,2004; 
Horne et al., 2006), statistical thermodynamics is applied to model competitive multiplexing hybridization. 
However, the methods there are primarily numerical in nature and do not provide detailed information 
about how to design collections of non-crosshybridizing strands to accomplish a desired amount of work 
within an acceptable error rate. This paper concerns exactly this question and presents a theoretical, not 
a heuristic, numerical or algorithmic, way to decide this question. Using a statistical thermodynamic and 
probabilistic model of DNA code fidelity coupled with mathematical random coding theory methods similar 
to those presented in D’yachkov et al. (2005b, 2003), theoretical lower bounds on the size of DNA codes 
are given. More importantly, DNA code design parameters, e.g., strand number, strand length and sequence 
composition, needed to achieve experimental self-assembly goals are identified. 

Single strands of DNA are represented by (A, C, G, T) -quaternary sequences that are oriented, either 
5' —> 3' or 3' —> 5'. In this paper, single stranded DNA molecules without an indicated direction are 
assumed to be in the 5' —*■ 3' direction. The reverse-complement of a DNA strand is defined by first 
reversing the order of the letters and then substituting each letter with its complement, A for T, C for G 
and vice-versa. For example, the reverse complement of AACGTG is CACGTT. Henceforth, complement 
means reverse-complement unless otherwise stated. For strand x, let x denote its complement. A (perfect) 
Watson-Crick duplex is the joining of complement sequences in opposite orientations so that every base 
of one strand is paired with its complementary base on the other strand in the double helix structure, 
i.e., x and x are “perfectly compatible.” However, when two, not necessarily complementary, oppositely 
directed DNA strands are “sufficiently compatible,” they too are capable of coalescing into a double 
stranded DNA duplex. The process of forming DNA duplexes from single strands is referred to as DNA 
hybridization. Crosshybridization is when two oppositely directed and non-complementary DNA strands 
form a duplex. Crosshybridization doesn’t always occur, but there is a potential for it to happen. In general, 
crosshybridization is undesirable as it usually leads to experimental error. To increase the accuracy and 
throughput of the applications listed above, there is a desire to have collections of DNA strands, as large 
and as mutually incompatible as possible, so that no crosshybridization can take place. 

Definition 1. Given two DNA strands x and y, we let x : y denote the DNA duplex formed between x 
and y. It is implicitly assumed that x and y are oppositely oriented in x : y with the first strand x always 
assumed to be in the 5' —> 3' direction and the second y always assumed to be in the 3' —> 5' direction. 
A crosshybridized (CH) duplex is an x : y, where y f x. 

Even though it is possible for complementary sequences to form a non-perfectly aligned duplex, we 
will call any % : x duplex a Watson-Crick (WC) duplex. Two oppositely directed copies of a single strand 
% can form x : x, which is a CH duplex if x is not self-complementary, e.g., x = ACGT = x. In the 
discussion below, self-complementary strands are largely forbidden. 


2. STACKED PAIRS AND UNSTACKED 2-STRINGS IN 
SECONDARY STRUCTURES 

Let x = xi,... ,x n and y = y\,.... y n be DNA sequences. For a base >y, let yj be its complement 
base. Then y = y n ,... ,y\. 

Definition 2. Suppose 1 ^ i r , j, f n. A secondary structure of the DNA duplex x : y is a sequence of 
pairs of complementary bases p = (xy, y n+ \-j r ) where xy = y „+1 - y and (xy) and (y M+ i_y) are increas¬ 
ing and decreasing subsequences ofx and y respectively. Given a secondary structure p = (xy , y„ + i_y) 
a stacked pair in a duplex is a pair of consecutively aligned complementary bases, xy = y n +i-_y, 
Xy + , = v«+i-y +P in p where i r +t = i r + 1 and y,-+i = y r + 1 The notation Xy x, r+1 / y n +1 -j r y n + 1 -j r +\ 
is used to denote a stacked pair. An unstacked 2-string ofx in a secondary structure p = (xy, y„ + i_y) is 
a 2-string, x/x,-+i. of x that is not part of any stacked pair in p. 

Clearly the x : y can have many secondary structures and stacked pairs and unstacked 2-strings must 
be defined relative to a given secondary structure. 
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FIG. 1. An example of a secondary structure in a DNA duplex. 


Example 1. The secondary structure in Figure 1 has stacked pairs (ED 

A^Gs/TuC\q, GsT 6 /C\oAg, T 6 T 7 /AgA&, TgA^/A 7 T(„ C11C12/G3G2 

where the subscripts indicate the position of the bases in the 5' —> 3' direction. Since 
A4G5, G5T 5, T 6 T 7 , TgA W , C11C12 

are the 5' —> 3' bases in stacked pairs in the exhibited secondary structure, then the unstacked 2-strings 


C1C2, C2C3, C3A4, T 7 T 8 , T 8 T 9 , AioCn, C12C13, C13C14. 

Definition 3. Given two DNA strands x and y, let S(x, y) and (J(x, y) respectively denote the maxi¬ 
mum and minimum number of stacked pairs and unstacked 2-strings over all secondary structure between 
x and y and in x. 

A dynamic programming method to compute Six, y) is given in D’yachkov et al. (2005a, 2006). It is 
implemented in DNA code software SynDCode which is available at Bishop et al. (2006). 

Definition 4. Let L be a collection of 2-strings of DNA bases closed under complementation, e.g., 

L = {AA, TT,AT, TA} or {AT}. A DNA sequence x = x\,.... x n is called an L sequence of length n if 
Xi Xi +1 ^ L for each 1 C t ^ n — 1. Let DNA(n, L) denote the set L sequences of length n. The cardinality 
of DNA(n, L) is denoted by A n ,L or just A„ when the context is clear. 

Throughout this paper, k,n,s,u and N denote positive integers and, whenever x e DNA(n,L ) is 
selected, it is assumed that each such x is equally likely. 

Proposition 1. Let L = {AA, TT, AT, TA}, then |ins-A: begin | 

A„,L = (i - Vs) + 5 (l + Vs) (2-D 

and 

l,,L< 4(1 +A)" -1 . (2.2) 

Proof. A sequence in DNA{n, L) can have at most |"|] of the letters A or T and no two of these can 
be consecutive. So given x e DNA(n, L), suppose the number of letters A or T it contains is k where 
0 fk f |~ |]. There are 2 k different ways to arrange these. Then between any two of the letters A or T at 
least one G or C must be inserted. By a classical combinatorial “objects in boxes” type argument, there 


n-*-(*-l) + (fr+l)- 
n-k-(k- 1) 


i-r.’i' 


are 



FREE ENERGY GAP AND DNA CODES 


1091 


ways to insert the n—k letters G or C. Thus 


It is known that 



where F(n) is recursively defined by F(n) = F(n — 1) + F(n — 2) and F(\) = 2 and F(2) = 3. By 
the solving the Fibonacci recurrence relation with the given initial conditions F(n) = 5 ~ | 3 ( ^ ( 1 j + 
5+^1 . From this, (2.1) follows. Since 

= 4(l + V5)' ! 


(2.2) is established. ■ 

Definition 5. For x,y,z e DNA(n, L) with x ^ y, let 

Bn.i.Ax) = { y ■ S(x,y ) > s}. 

A n ,L, s = {Z ; S(Z, z) > 4. 

Proposition 2. Let L = {AA, TT,AT, TA} and x,y,z e DNA(n, L). 

a. iiw*)i< E ( s A)( n 7) 2 * n ~ a ~ j ■ 


b. |B„,l, s (x)U 




min(i,n-i) * 

Vn.eM E ([ipCT) 4 ""^' 

s+j even 

\A,u .,\< “E” ( L |ji,)(7') min 4^1 (l + . 
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Proof. Proposition 2a appears as Proposition 12 in (D’yachkov et al., 2006) for L = 0. A relatively 
straightforward modification of the proof there yields Proposition 2b here. Also see (D’yachkov et al., 
2005b) where a proof of Proposition 2c appears. 

To show Proposition 2b here, let X , Y be binary sequences of length n over {0,1} and {0,2} respectively. 
There are ( ,! ~ l | ) (” _,s ) 2 pairs X and Y such that each have s + j 0s with the 0s partitioned into j substrings 
so that each substring has at least two O.v and these 0s are partitioned in exactly the same way in X and 
Y. For example, X = 000, 1,00,11,0000 and Y = 2, 000, 00, 2,0000, 2 where the common partition of 
the 0s is 000,00,0000. 

Fix x € DNA(n , L ) and suppose y e DNA(n, L) has S(x, y) ^ s. By the arguments in (D’yachkov et al., 
2005a) and (D’yachkov et al., 2005b), for some 1 < j < min(s, n — s) there is a common subsequence 
( x !/t) = (Yjk ) °f length .y + j of x and y respectively that can be obtained from a pair X , Y above by 
taking ( xi k ), (yj k ) to be the subsequences in x and y that correspond to the positions of the 0s is in X, Y 
respectively. 

Since x is fixed, place y in class X, Y if the required common subsequence was obtained from this pair. 
The number of different v in a given class is the number of {A, C, G, T) sequences that can arise from 
aye DNA(n , L ) when it is restricted to the positions of the 2s in Y. The number of such subsequences 
that can arise is at most 


HuL ■ A b 2 ,L ■ • •• ■ Aj y+1 ,L 


where 


j+1 

y ^bj = n — s - j for b t > 0 and Aq ,l = 1. 


Am, 


• A b2 ,L ■ ••• • A bj+u L < nun ^4""^, 4 j+1 (l + Vs)" ^ 


and Proposition 2b then follows. 

To show Proposition 2d here, let Z be a binary sequence of length n over {0,1}. There are (r -^ )(' i-,s ) 

kh 1 

such Z that have s + j 0s with s+j even and with the 0s symmetrically partitioned into j substrings 
so that each substring has at least two (F and the first s -^j- 0s are arranged as the mirror image of the last 
s -^- (F . For example, e.g., Z = 11,000,1,000,00,111,000,000,1 where the symmetric partition of the 
0 s is 000,000,0|0,000,000. 

Let z e DNA(n, L) and suppose S(z, z) ^ s. By the arguments in (D’yachkov et al., 2005b), for some 
1 < j < minfy, n—s), there is a Z as described above such that the positions of the 0s in Z correspond to 
the positions of a self-complement subsequence of length s + j in z. Since a self-complement subsequence 
of length s + j is determined by its first Lti entries, it follows that there are most 


(i + vsjHW'+rfMj 

= min (l + 


z in each class Z because there are at j + 1 blocks of Is and |"^"| blocks of 0s for which there is choice 
to place substrings of DNA(bi,L ) to construct sequences that capture all possible z in class Z. ■ 


The following Corollary 1 is now trivial. 
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Corollary 1. Let 0 ^ s ^ n — 1. Suppose L = \AA. TT,AT, TA}. Select x, y e DNA{n, L) with x ^ y. 

Then 

a. Pt(5(jc, y) f s) f 7- ^ ('jl\)( n ~ s ) 2 min , 4 /+l (l + 77) ' j . 

i. Pf(S(*, *)>*)«£ J2 ( r ^,)(7> in ( 4 ^. 4 r^l (l+Vs)' ; ' J 

s+j even 

Proof. Part a follows from Proposition 2b. Part b follows from Proposition 2d. ■ |ins-A: end| 

Definition 6. Let U(x, y) denote the minimum number of unstacked 2-strings in x over all secondary 
structures between x and y. 


If x and y have a maximum number of s stacked pairs over all secondary structures, then there must 
be a minimum number of n — s — 1 unstacked 2-strings among the X1X2 ,x n -\x n that are not stacked. 
Thus, U(x, y) < u if and only if S(x, y) f n — u — ]. So the next Corollary 2 follows from Corollary 1. 


Corollary 2. Let 0 f u f n—]. Suppose L = {AA, TT,AT, TA}. Select x, y e DNA(n, L) with x ^ y. |ins-B: begin| 
Then 


a. Pr (1 /(jc, y) < u) < Ffu, n) (2.3) 

where 

min(w+l,n—w—1) . 2 i\ 

F 1 (u,n)=— (”7-7 2 )C7) 2 ^ ( 4 “ +W ,4 J + I (l + 7s) J . 

b. Pr (U(x,x) < u) < F 2 (u,n) (2.4) 


where 


' xf (fjpcr)-*■ (4*4^, 4 m (1+. 


|INS-B: END | 


3. BOUNDS ON NEAREST NEIGHBOR THERMODYNAMICS 

Stacked pairs play a special role in the Nearest Neighbor (AW) model of DNA duplex thermodynam¬ 
ics (SantaLucia, 1998; Zuker et al., 1999). Briefly, local thermodynamic functions AH, AS, which are 
essentially independent of temperature T, are experimentally found for stacked pairs and other secondary 
structure motifs and then are used in an additive fashion to predict global thermodynamic values for 
duplexes. Free energy, AG, at a given temperature T is derived from AH, AS by 

AG = AH - TAS (3.1) 

One example of these local functions for stacked pairs is given in Table 1 (SantaLucia, 1998). A (H) 
demonstration of how these local functions are used to make global predictions is given in Example 2. 
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Table 1. Nearest Neighbor Thermodynamic Values 
for Stacked Pairs 


Stacked pair 

5' 3'/3' 5' 

AH kcal/mol 

AS caU°Kmol 

AG3 iqo k kcal/mol 

AA/TT=TT/AA 

-7.9 

-22.2 

-1.02 

AC/TG=GT/CA 

-8.4 

-22.4 

-1.46 

AG/TC=CT/GA 

-7.8 

-21.0 

-1.29 

AT/TA 

-7.2 

-20.4 

-0.88 

CA/GT=TG/AC 

-8.5 

-22.7 

-1.46 

CC/GG=GG/CC 

-8.0 

-19.9 

-1.83 

CG/GC 

-10.6 

-27.2 

-2.17 

GA/CT=TC/AG 

-8.2 

-22.2 

-1.32 

GC/CG 

-9.8 

-24.4 

-2.24 

TA/AT 

-7.2 

-21.3 

-0.60 


Example 2. The AH, AS of the WC duplex x : x = AGTCA.TCAGT predicted by the NN model is 
computed by essentially summing associated Table 1 values for the duplex’s stacked pairs, 

AG/TC, GT/CA, TC/AG, CA/GT, 

then adding a constant initiation penalty IP. Thus for AGTCA-TCAGA, 

AH (duplex) = -7.8 - 8.4 - 8.2 - 8.5 + IP X = -32.9 + IP X kcal/mol 

AS (duplex) = -0.0210 - 0.0224 - 0.0222 - 0.0227 + IP 2 = -0.0883 + IP 2 kcal/°Kmol. 

From these computed values, the free energy for the WC duplex is given by (3.1) and thus for the duplex 
at 310° K, 


AG (duplex) = -5.51 + I Pi + IP 2 kcal/mol. 

Another way to accomplish the same task is to first compute the AG for each stacked pair at a given 
temperature, sum these values and add IP = IP\ + IP 2 . The AG for stacked pairs at 310°^ is given in 
the last column of Table 1. 

Predictions about the thermodynamic stability of CH duplexes with a given secondary structure can 
also be made from the stacked pairs that it contains. In D’yachkov et al. (2005a, 2006), it is argued that 
the AG for a CH duplex is bounded below the sum of all the free energies of the stacked pairs that it 
contains plus IP. Note, the more negative AG means more stable. The claim is strongly supported by 
comparing this thermodynamically weighted stacked pair sum computed by SynDCode (Bishop et al., 
2006) to computations made by NN minimum free energy software Pairfold (Andronescu et al., 2003) and 
RNAstructure (Mathews et al., 2006). 

Example 3. The secondary structure in Figure 1 has stacked pairs 

A4G5/T11C10, G 5 T 6 /CioA 9 , T^Tt/AvA?,, T^Aiq/A-jT^, CuC\ 2 /G 3 G 2 . 

Hence at 310°K and using values from Table 1, the AG of the duplex with the indicated secondary structure 
is bounded below by 


-1.29 - 1.46 - 1.02 - 1.60 - 1.83 = -7.20 + IP kcal/mol. 

Thus using the standard IP = 1.96, SynDCode (Bishop et al., 2006) gives AG(duplex) = —6.24 kcal/mol. 
Pairfold (Andronescu et al., 2003; Mathews et al., 2006) respectively give —2.48 and —2.70 kcal/mol. 
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4. RELATIVE STABILITY AND PROBABILITY VIA 
STATISTICAL THERMODYNAMICS 

One question is: How does one measure relative stability of DNA duplexes? For example, given only N 

possible secondary p\,..., p# structures for a duplex with free energy AGi.AG,v respectively where 

each AG, K 0, how likely is the duplex to have p,? Thinking of the secondary structures as states, a 
statistical thermodynamic partition function argument can be applied. Let <w, be some real-valued function 
of AG, . Then the partition function Q is given by: 


N 

2 = Z>- 


(4.1) 


Using Q, the probability that the duplex has secondary structure p, is given by 

Pr(Pi) = (4.2) 

Typically, cu,- = exp(| AG,- \/RT) where R = 0.0019872 kcal/°Kmol is the gas constant and T is temper¬ 
ature in degrees Kelvin. Following this thread: 

-i-, (4 , 3) 

f>p(|AG,|/*r) 1 + E ex P*-(|AGil - IAG,l)/«r) 

If for all i + 1, | AGi | - | AG,-1 ^ g ^ 0. Then 

Pr(pi) > , , TTT - —7 -777TT' (4 ' 4) 

1 + (N - l)exp {-g/RT) 

This leads to the use of the minimum free energy gap as a measure of performance. This is explored in 
the next section (Tulpan et al., 2005; Penchovsky and Ackermann, 2003; Shortreed et al., 2005). 


5. STATISTICAL THERMODYNAMIC SIGNIFICANCE 
OF THE FREE ENERGY GAP 

Since each possible stacked pair can be identified with its 2-string in the 5' —*■ 3' strand (usually x) in 
the duplex, the absolute values of thermodynamic parameters in Table 1 are a function of 2-strings of bases. 
These positive functions are given in Table 2 where AH, AS are renamed as /# and fs respectively. 


Table 2. Positive Thermodynamic 
Functions on Two Strings 


Two-string 

5' ^ 3' 

fH 

fs 

fG,310 

AA=TT 

7.9 

22.2 

1.02 

AC=GT 

8.4 

22.4 

1.46 

% 

II 

H 

7.8 

21.0 

1.29 

AT 

7.2 

20.4 

0.88 

s 

II 

< 

O 

8.5 

22.7 

1.46 

CC=GG 

8.0 

19.9 

1.83 

CG 

10.6 

27.2 

2.17 

GA=TC 

8.2 

22.2 

1.32 

GC 

9.8 

24.4 

2.24 

TA 

7.2 

21.3 

0.60 


(T2) 
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The associated absolute value of the free energy for stacked pairs, / g,t , at a given temperature T can 
be computed by using the following version of the classical equation (3.1): 

fc,T = T ■ f s - fH■ (5.1) 

For example, /g, 3 io is given in the last column of Table 2. Note that each of the functions /#, fs and 
fa r have the property that /(x, x, + i ) = /(v, + i x ( ), i.e., they are invariant under complementation. 

Henceforth fH , fs will be arbitrary positive functions on 2-strings invariant under complementation 
and fa,T will be assume to have been derived from such via (5.1). 

Definition 7. Given distinct and non-self-complementary strands x, y in DNA(n, L), consider the WC 
duplex x : x and the CH duplexes, x : x, x : y. Let ||x, x|| denote the absolute value of the AG of the 
WC duplex x : x in perfect alignment and let |j x, x(| and \\x, y || denote the absolute values of the AG of 
most stable secondary structures for the x : x, x : y CH duplexes The quantity 

\\x,x\\-\\x,y\\ 


is plainly referred to as the asymmetric free energy gap. 

Given distinct and non-self-complementary strands x\, X2, ..., Xjq in DNA(n,L), think of strand x\ 
having N + 1 possible states. It can either form a WC duplex with its complement X\ or it can form a 
CH duplex with one of the other N strands X\,X2 , ■ ■ ■ ,xn (itself included as there may be multiple copies 
of each strand.) Let x = x\. Then following the partition function argument given in (4.3) and (4.4), the 
probability that x forms a WC duplex is 


Pr(x : x) = 


exp(||.v : x\\/RT) 

N 

exp(||A- :x\\/RT) + exp(||x : x\\/RT) + J>xp(||x : x^/RT) 


(5.2) 


Now suppose that for i f 1, ||x, x|| — ||x, x|| and ||x, x|| — ||x, x,|| are all at least g ^ 0. Then by (4.4) 
and (5.2), 

*<* ; *>» l + ^ + olxpH./STy < 5 ' 3 » 

Suppose C = {{xi,Xi}}^ = j is a collection of N complementary pairs of strands in DNA(n, L) of multi¬ 
plicity 2, such that the 2 N strand types are distinct and thus no strand type is self-complementary. Then 
there are a total of AN strands in solution. Suppose further that ||x,-, x,-1| — ||x,-,y|| ^ g ^ 0 for all y = xj 
or y = Xj where j f i. Then for any strand xp. 

Pr ( Xi : Xi) = Pr(*,- : x t ) > * - — (5.4) 

1 + (2 N - l)exp (-g/RT) 

Under the reasonable assumption that the formation of the x, : Xj duplex is independent (or doesn’t reduce 
the probability) of the formation of the Xj : Xj duplex, then: 


Given the entire collection C of AN strands, the probability that 2 N WC duplexes form so that there are 
no CH duplexes is at least 

( 1 \ 2N 

V1 + (IN — 1) exp(—g/ RT)) ' 

If (5.5) is to be at least a where 0 < a < 1, then solving for g in terms of a and N yields 


g = 


-RT In 


fa - 1 / 2iV _n 

V 2N-1 ) 


(5.6) 
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Definition 8. Given faj and any complementary pair x : x, define 


\\x,x\\fc.T 


fc,.r(xiXi+\). 


(5.7) 


Suppose x and y are non-complementary strands. Given a secondary structure p = (xj r , y n +\-j r ) between 
x and y, let A(p ) be the stacked pairs in p. Define 

II*. y\\fG,T,„ = fc,r(x Jr x jrU ). (5.8) 

x jr x jr+l eA(p) 

\\x, y\\f G , T = max(||x, j||/ G , r ,p). (5.9) 

yfG,r(x,y ) = \\x,x\\ f GT - \\x, y\\f GT (5.10) 

Vf GT (x,y) is called the asymmetric stacked pair free energy gap. y is written for y/ GT when the context 
is clear. 

Proposition 3. Let x, y € DNA(n, L). Suppose 0 < c < faj for all x,-jc,+i f L. Then yf GT (x, y) > 
U(x, y) ■c 

Proof. Let p = (xj r , y„+\-j r ) be a a secondary structure between x and y, let B x (p) be the unstacked 
pairs in x relative to p. From (5.7)—(5.11), it follows that 

Yf G ,T(x,y)= X! fG.T(x kr x kr+l ) 

x k r X kr + l €B x(p) 

for some p. Since for every p , B x (p)\ > U(x,y), the result follows. ■ 

Note y(x, x) = 0 and 


\\x,y\\f G . T = \\y,x\\ fGJ 


\\x,y\\f G , T = \\x,y\\ f G.T 


y(x,y) = y(x,y). 


However, since 


\\x,x\\f G .T + IIJ- y\\fa.T 


then 


y(x,y) + y(y,x) 

and this is why the term “asymmetric” is used. 

In D’yachkov et al. (2005a, 2006), it is discussed how ||x,y||/ GJ . — IP is an upper bound on ||x, y| 
when fn, fs are as given in Table 2. The NN model gives that ||x, x|| = ||x, x|| f CT — IP, so it follows 
that 


y(x,y)< \\x,x\\-\\x,y\\. 


(5.11) 


Thus the asymmetric stacked pair free energy gap is a lower bound for the asymmetric free energy gap. 
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Definition 9 is central to this paper. 

Definition 9. Let fcj be as given in equation (5.1) for some fn, fs • A DNAf GT (n, L,g) code C is 
a collection of complementary pairs in DNA(n, L ) such that no strand is self-complementary and for any 
two non-complementary strands x, y in C: 


YfG.r(x,y) > g. 


DNA f GT ( n,L,g) codes are also called free energy gap DNA codes. 

The DNA code software SynDCode (Bishop et at., 2006) generates DNAf GT (n, L, g) codes C with 
many additional and optional user added sequence constraints (see Example 7 and Conclusion). It should 
also be noted that for x, y e C that since y(x, y) > g and y(y, x) > g that 

min(||x, x\\, \\y,y\\ - \\x, y\\) > g. (5.12) 

Thus DNAf G T (n, L, g) codes are nearly of the type discussed in Tulpan et al. (2005), Penchovsky and 
Ackermann (2003), and Shortreed et al. (2005), and in which (5.12) is nearly one of the of main constraints 
(see Conclusion). 


6. RANDOM CODING BOUND FOR HIGH FIDELITY 

DNA fG T ( n,L,g ) CODES 

Equations (5.6) and (5.11) provide a relationship between the free energy gap and the probability of 
correct self-assembly. 

Definition 10. Suppose only the strands of a DNAf GT ( n,L,g ) code C are present in solution in equal 
concentrations. For 0 < a < 1, then C self-assembles with fidelity a if a is the probability that every 
strand in C forms a WC duplex. 

If the model of having exactly two copies of the strands of a DNAf GT (n, L, g) code C is assumed to be 
reasonable for strands in equal concentrations, then from equation (5.11), a DNAf GT ( n,L,g ) code C with 
N pairs has fidelity a if, for any two non-complementary strands x, y in C, the asymmetric free energy 
gap y(x, y) satisfies: 


y(x,y)>-RT\n\—^— r \=g. (6.1) 

Below a random coding theory method is used to get a lower bound on the number of complementary 
pairs A of a DNAj GT (n,L,g) code. 

Example 4. If C is a DNAf GT (n, L,g ) with N pairs of multiplicity 2 where g is given by (6.1) when 
a = 1 — 2F> then out of the desired 2N WC duplexes, an expected number of one WC duplex does not 
form. In Figure 2, a sufficient stacked pair free energy gap g(a, N) with a = 1 — ^ is plotted against 
logio(N). 

Definition 11. Let U = {{x, x) : {x, x} e DNA(n, L)}. An E c|( is called a random DNA(n, L) k-set 
of pairs if\E\ = k and the uniform distribution is on the k-sets ofU. 

For the remainder of this paper E is assumed to be a random DNA(n, L ) k-set of pairs. 
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stacked pair free energy gap 

FIG. 2. Stacked pair free energy gap versus code size. 


Definition 12. For a real number g > 0, we say that a complementary pair {x,, x,} is yf GT g-bad in 
E if for any other {xj, xf\ in E either: 


Yf G .T (xf, x,- 

) = Y f G, T (.Xi > Xf) < g, 

(6.2) 

Yf G , T (xi,Xj 

> = YfG.T (Xi,Xj) < g, 

(6.3) 

YfG.T (, x i • Xj 

> = YfG.r(Xi,Xj) < g. 

(6.4) 


A complementary pair {x,-, x,-} is called yp GT g-good in E if it is not yf G T g-bad. 

The following is obvious: 

Proposition 4. The collection of yf G T g-good pairs in E is an DNAf G T (n, L, g) code. 

Lemma 1. Let x,y e DNA(n,L ) be randomly selected without replacement. Then there exists a |ins-C: begin| 

DNAf G T (n, L, g) code with N complementary pairs of strands from DNA(n, L ) if 

Pr (y(x, x) <g) + (2 N - 1) Pr (y(x, y) < g) < ±. (6.5) 

Proof. Let f g (x,y) be the probability that either y(x, y) < g or y(x, y) < g. Since Pr(y(x, y) < 
g ) = Pr(y(x, V) < g) , then 

P g (x,y)^2MY(.x,y)<g). ( 6 . 6 ) 

Let 

P g (x, x) = Pr (y(x, x)<g). (6.7) 

Let |E| = 2 N. From the additive bound of the probability of the union of events, it follows that the 
probability that {x,-, x,-} is y g-bad in E is at most 


p g (x,x) + (2N-l)p g (x,y). 


(6.8) 
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Thus the probability that (x,-, x,} is good in E is 


1 - (p g (x, x ) + (2 N - 1)/S g (x, y)). (6.9) 

The main point of all of this is that the expected number of good pairs in E is 

2N (1 - (P g (x, x) + (2 N - l)p g (x, y))). (6.10) 

and (6.10) will be at least N when 

P g {x,x) + {2N-\)P g {x,y)^ 1 -. (6.11) 

Thus given (6.11), there must exist an E that contains N yf GT g-good pairs in E. Hence the result follows 
from Proposition 4. ■ 

Proposition 5. Given f G j, suppose 0 f c f f G j for all x;x !+ i L. Let g > 0 and suppose that £ 
is not an integer. Then for distinct x, y e DNA(n, L) 

a. Pr(y/ Gr (x, y) < g) 4 ^([JJ ,n). (6.12) 

b. Pr(y fGT (x,x) < g) < /^([|j ,«). (6.13) 

Proof. Applying Proposition 3, it follows that 

Pr (Yfox (*. }’) <g)f Pr (U(x, y) 4 [* J). (6.14) 


The result follows from Corollary 2. 


Theorem 1. Let L = {AA,TT,AT,TA}. Given fcj, suppose 0 f c f f G j for all x,-x,-+i $£ L. Let 
g > 0 and suppose that ^ is not an integer. Let S\ ( N, n,g,c ) and 82(N, n,g,c ) be as given in Proposition 5. 

Define Bad/f N, n, g, c ) to be: 

Bad L (N,n, g, c) = F 2 ([|J ,n) + (2 N - l)Fi([|J .n). (6.15) 

If BadL(N,n, g, c ) < then there exists a DNAf GT (n, L, g) code with N complementary pairs. 

Proof. Apply Lemma 1 and Propositions 4 and 5. ■ |ins-C: end| 


7. RESULTS 

In the Examples 5-7 below, let L\ = {AA, TT, AT, TA} and L 2 = 0. The probabilities for the L 2 case 
can be obtained by using Proposition 2 parts a and c. Given f G j for /#, fs in Table 2, the appropriate 
value for c in Theorem 1 is c\ = 1.29 and c 2 = 0.60 respectively. As in Example 4, for a given N, which 
denotes the number of pairs in a desired code, let = 1 — jJv an d let g,v = g(a,v, N ) be given by (5.6) 
when T = 310°K. 

Example 5. In Figure 3, using the left y-axis, the blue and yellow graphs plot the points (t\ , log 10 N\) |F3 1 

and (?2,1 o g | q N 2 ) respectively that minimize even U for given Ni subject to the constraint that Bad Li 
(N , ti,gN,Cj) f \ as given in (6.15). The right y-axis gives the corresponding points (t,-, g,v,)- Thus, U is 
the theoretically computed sufficient length of DNA strands such that there is a DNAj G T (f,-, Li, g, ) code 
that self-assembles with fidelity a,- as defined in Definition 10. Note that U is the sufficient length of DNA 
strands such that there is a DNA j G T («,•, Li , gs/) code that self-assembles with an expected failure rate of 
one WC duplexes failing to form per all possible 2 N WC duplexes in the code (of strand multiplicity two.) 
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To take a specific instance, if N\ = 1000, then = 0.9995, so = 14.05. Thus, a sufficient strand 
length is t\ =33, because 

Bad Ll (1000,33, 14.05,1.29) = 0.42 

while 

Bad Ll ( 1000,32,14.05,1.29) = 1.02. 

However, ?2 = 54, because 

Bad Ll ( 1000, 54,14.05,1.29) = 0.23 




Bad Ll ( 1000,53,14.05,1.29) = 0.64. 

This difference depends on the difference between L\ and Li and is discussed in the Conclusion section 
below. 

Example 6. Given foj and L,, a# and gsi as in Example 5, the pink and light blue graphs in 
Figure 3 respectively plot the points (ei, log 10 Ai) and (e-i, ]og K) Nf) that minimize strand length e,- for 
given Ni subject to the constraint that 


as given in (6.11) and where 

p u (x,x) + (2N-m, g (x,y) 

was empirically estimated by performing 10 6 trials sampling from ZWA(e,, L,). Thus e,- empirically esti¬ 
mates the sufficient length of DNA strands such that there is DNAj gt (e,-, L it g^) code with N pairs that 
self-assembles with fidelity as defined in Definition 10. To take a specific instance, if N\ = 1000, 




1102 


BISHOP ET AL. 


then the empirically estimated sufficient strand length is e\ = 27 for strands that don’t contain 2-strings 
in {AA,TT, AT,TA}. Note that for N2 = 1000, the empirically estimated sufficient strand length is 
e\ = 28 for strands with no restrictions on their 2-strings. In general, the graphs indicate that codes taken 
from DNA(n, L\) have slightly better emperically estimated random coding bounds than those taken from 
DNA(n, Lf). 

Example 7. Given fo r and L\, a,v and g,v as in Example 5, the maroon graph in Figure 3 plots the 
points (s, logjQ N) that minimize strand length s for given N subject to the constraint that the DNA code 
software SynDCode (Bishop et al„ 2006) has produced a DNAf GT (s, L\, g^) code. 


8. CONCLUSION 

A new type of DNA code, called a DNAf GT (n,L,g ) free energy gap code has been defined and a 
statistical thermodynamic probabilistic model for DNAf GT (n, L, g) code self-assembly has been given. 
Theoretical and empirical random coding lower bounds for DNAf GT (n, L, g) codes were obtained and 
used to determine sufficient DNA code design parameters needed to achieve experimental goals. It has 
been noted that small changes in sequence composition, e.g., whether the consecutive pairs, AA, TT, 
AT or TA appear in any code sequence makes a difference in the potential fidelity of a DNA code, but 
perhaps not such as wide a difference in the theoretical sufficient strand length bounds between the cases 
when L\ = {AA, TT,AT , TA} and L2 = 0 as is shown in Example 5. By observing the empirical results 
in Examples 6 and 7, it is clear the theoretical bound for the case L2 = 0 exhibited in Figure 3 by 
the yellow graph is poor in comparison to that given for the case L\ = {AA, TT,AT, TA} exhibited in 
Figure 3 by the blue graph. This seems to be because that in the L2 = 0 case, the c = 0.60 for Thereom 
1 is far further from the average value 1.42 of foj over the stack pairs not in L2 = 0 than is the 
c = 1.29 in the L\ = {AA, TT,AT, TA} case from average value 1.59 of fcj over the stack pairs not in 
Li = {AA, TT,AT, TA}. 

Earlier work on free energy gap collections of oligos is summarized in Tulpan et al. (2005). The results 
presented here suggest that the free energy gaps for collections that are given there may be too small. As 
discussed in Example 5, the fidelity a of a code with N pairs should be greater than 1 — ^ if that code 
is to self-assemble with an expected failure rate of less than one WC duplex. Thus, if N > 64, it follows 
from (5.6) that the free energy gap must be greater than 8.95. However, none of the 19 collections given 
in Table 2 of Tulpan et al. (2005) with N > 64 have a free energy gap greater that 8.95. Moreover, the 
SynDCode data exhibited in Figure 3, indicates that for strands of length 16, codes larger than those listed 
in Tulpan et al. (2005) may be found. Other sequence constraints for collections of oligos are considered 
in Tulpan et al. (2005). SynDCode (Bishop et al., 2006) also allows for consideration for many of these 
same constraints. For example, the code S8-2 given in Tulpan et al. (2005) is nearly a DNAf GT (16, L, 7.85) 
code where L = {GGG, CCC}. It is not exactly such a code for several reasons. The two most significant 
reasons are as follows: 

1. S8-2 doesn’t constrain the what is called the word-word crosshybridization potential (because an un¬ 
derlying in assumption (Tulpan et al., 2005) is that the strands x are fixed to a surface.) 

2. S8-2 uses Pairfold (Andronescu et al., 2003) to measure the free energy gap while DNAj GT (16, L, 7.85) 
uses y faT . 

As was noted earlier, yf GT is more restrictive that Pairfold (see Example 3). Some of the most 
important additional constraints for S 8 — 2 are: 

For x in 58 — 2 that: 

a. CC is not contained at the start or end of x and therefore GG is not contained at the start or end 
of x. 

b. G is not contained in x and therefore C is not contained in x. 

c. 15.45 < ||x, x|| fGT — IP < 16.42 where we are assuming IP = 1.96. 

SynDCode can generate DNAf GT (n, L, g) with each of these additional constrains. In particular, in 
Figure 4, an example of a DNAj G T (16, {GGG, CCC}, 7.85) code is provided that satisfies conditions a-c. 
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Code Strands 

TCCTAAACCATTCTTA 

Si 

TAAGAATGGTTTAGGA 

Ci 


CATCAAAAAACT CAAT 

S50 

ATTGAGTTTTTTGATG 


ACACATACTACTTCAA 

s 2 

TTGAAGTAGTATGTGT 

c 2 

AAAACACTTCTTTACT 

S51 

AGTAAAGAAGTGTTTT 


TTTCATTTCCAAAAAC 

s 3 

GTTTTTGGAAATGAAA 

c 3 

TTTTTACCTTAACCTC 


GAGGTTAAGGTAAAAA 


TAATAAACACACTCCA 

s 4 

TGGAGTGTGTTTATTA 

c 4 

CAATCTCTCTTCATAC 

S 53 

GTATGAAGAGAGATTG 

C53 

CACTCTCACATTAAAA 

s 5 

TTTTAATGTGAGAGTG 

c 5 

ACACTAAAAAAACAAC 


GTTGTTTTTTTAGTGT 


ACATCTTCCTATACAA 

S 6 

TTGTATAGGAAGATGT 

Cs 

ATCATAACACAAACTT 

S55 

AAGTTTGTGTTATGAT 

cs. 

AT CCAT CAAACATAAA 


TTTATGTTTGATGGAT 


CACACATCATTTTTTC 


GAAAAAATGATGTGTG 


TTTCTCAATCCTACTA 


TAGTAGGATTGAGAAA 


CTATCCACCTAAAAAA 

S57 



TTATACACCTCACTTT 

s. 

AAAGTGAGGTGTATAA 

Cs 

TATATCCTCTCCAATC 

Sss 

GATTGGAGAGGATATA 

Css 

AAAATTCATCAACCAA 


TTGGTTGATGAATTTT 

C10 

CTACATATCTAACCAC 

S59 

GTGGTTAGATATGTAG 

C59 

CAACCACTTAATCTTC 

Sii 

GAAGATTAAGTGGTTG 

C11 

CTTACTACCAAACTTC 


GAAGTTTGGTAGTAAG 

C60 

ACCTTTTCTAACTCAT 

Sl2 

ATGAGTTAGAAAAGGT 

C12 

AAAAAATCCACATTTC 

Sei 

GAAATGTGGATTTTTT 

C 6 i 

ACTTATAAACCTACCA 

Sl3 

TGGTAGGTTTATAAGT 

C13 

AATTCTCCACTTTAAC 

S62 

GTTAAAGTGGAGAATT 

C 6 2 

ACTCACCTCAATATAC 

S14 

GTATATTGAGGTGAGT 


AATTTCAACTATCACA 

s 63 

TGTGATAGTTGAAATT 

C 6 3 

TCCATATTCATACCTC 

Sis 

GAGGTATGAATATGGA 

Cl5 

CAAAAAAAATCTCCTC 

s 64 

GAGGAGATTTTTTTTG 

C 64 

CACAATTTTTACAACT 

Sl6 

AGTTGTAAAAATTGTG 

Cl6 

TCACTTTTACCATTAC 

S65 

GTAATGGTAAAAGTGA 

Ces 

ACTCTTTCTTTTCCTT 

S17 

AAGGAAAAGAAAGAGT 


TACTCCAACATTTTAC 

S 66 

GTAAAATGTTGGAGTA 

C66 

TCTCATTAACATACAC 

Sis 

GTGTATGTTAATGAGA 

Cl8 

AAAACACCACAAATAA 

S 6 7 

TTATTTGTGGTGTTTT 

C 6 7 

CATT CT TACTCTCT CT 

S19 

AGAGAGAGTAAGAATG 

Cl, 

TAACTTTCTATCTCCA 

S 68 

TGGAGATAGAAAGTTA 


CTCTCTCAACTTAATC 


GATTAAGTTGAGAGAG 



CTCCTCTTCTCTATTA 


TAATAGAGAAGAGGAG 

Cs, 

AAACTAACATCAACAT 

S 2 1 

ATGTTGATGTTAGTTT 


ACCTAACAATACAATC 


GATTGTATTGTTAGGT 

C,o 


S 22 



TCCTAAAATATCACAC 

S71 

GTGTGATATTTTAGGA 

C7I 

CTCTAACTACTACCTT 

S 23 

AAGGTAGTAGTTAGAG 


ACCAAAATACTCTTAC 

S72 

GTAAGAGTATTTTGGT 

C72 

CTTCATCATTTCTCTT 

S24 

AAGAGAAATGATGAAG 

C24 

ACAACTTCTCAATTTT 

S73 

AAAATTGAGAAGTTGT 

C73 

ATTAACTCCTCCTAAA 


TTTAGGAGGAGTTAAT 


AACATTACCTAATCAC 

S74 


C74 

CAACATACAATTACCA 


TGGTAATTGTATGTTG 

C26 

ATAACTCATCTTTCAC 


GTGAAAGATGAGTTAT 

C,s 

CTCTTACAAAACATCT, 

S27 

AGATGTTTTGTAAGAG. 

C27 

TACACAACATAT CCTA 


TAGGATATGTTGTGTA 

C76 

CAAAAACCAACTTTAA, 

s 29 

TTAAAGTTGGTTTTTG 

C28 

CTTCTACATCTCAAAA 


TTTTGAGATGTAGAAG 

c” 

TCTCTAATCACCAATA, 


TATTGGTGATTAGAGA. 

C30 

CTAAATACCATCCAAC 


GTTGGATGGTATTTAG 

C„ 

ATATTACCACATCCTT, 


AAGGATGTGGTAATAT. 


ACTATCCAATTAACAC 


GTGTTAATTGGATAGT 


CATTTCACTCAAATTT, 


AAATTTGAGTGAAATG 

C32 

TTTTTTATCTCACACA 


TGTGTGAGATAAAAAA 

C81 

CTTTACTTCCACAATA, 

Sis 

TATTGTGGAAGTAAAG. 

C33 

CTACTACACTACACAT 


ATGTGTAGTGTAGTAG 

C 82 

CACATATCCAAATCTC, 




ATTTTCACATTCTACA 

S83 


Ce3 

CACCATAAATCATCAT, 


ATGATGATTTATGGTG. 


CAATATTCTTACACCA 


TGGTGTAAGAATATTG 

Cs4 

TTCACAAACCTTAATT, 

S36 

AATTAAGGTTTGTGAA 

C36 

TCATACTTTATCCTCA 

S 8 5 

TGAGGATAAAGTATGA 

C 8 5 

ACCTCTTTATTCAAAC, 


GTTTGAATAAAGAGGT. 

C37 

TTTCCAATACATCAAT 



ATTGATGTATTGGAAA 

Ce6 

TCCTTACACAAATAAC, 


GTTATTTGTGTAAGGA. 

C38 

AAATCAAATACCTCTC 



GAGAGGTATTTGATTT 

C S 7 

ACAAATCACTTAAACA, 


TGTTTAAGTGATTTGT. 

C39 

TCCATTCCTTCTAATA 



TATTAGAAGGAATGGA 

c 88 

CTCCTTCAATTTTCAA, 



C40 

TCCTTATTACTCCTAC 




Cs9 

ATCAACACTATTCATC, 

S41 

GATGAATAGTGTTGAT 

C41 

CTTCTCTTATTTCACA 



TGTGAAATAAGAGAAG 

C90 

TTCTTTCCTCATTTTT, 

S42 

AAAAATGAGGAAAGAA 

C42 

TACTTCCTTTTTCTTC 

S91 

GAAGAAAAAGGAAGTA 

C91 

TATCACTCATACATCT, 

S43 

AGATGTATGAGTGATA. 

C43 

ACCACACACTATATAT 

S92 

ATATATAGTGTGTGGT 

C,2 

CAACCATATCAAAAAC, 


GTTTTTGATATGGTTG. 


TACTTCACTAATTCCT 


AGGAATTAGTGAAGTA 


AATATCCTTTCTCACT, 

S 45 

AGTGAGAAAGGATATT 

C45 

CTTCAACAACTAACTA 


TAGTTAGTTGTTGAAG 

C94 

AAACTATTTCCACTCT, 


AGAGTGGAAATAGTTT. 


ACACCTATTATTCTCT 


AGAGAATAATAGGTGT 

C95 

TTTTTCATTCACCTTT, 

S47 

AAAGGTGAATGAAAAA 

C47 

ACAATCAATTCTACTC 

S96 

GAGTAGAATTGATTGT 

C96 

ATAACCAACCTACATT, 

Sis 

AATGTAGGTTGGTTAT. 

C48 

CTATCAAACTCCTTTT 


AAAAGGAGTTTGATAG 

C97 

ACCATTTTTATTCCAT, 


ATGGAATAAAAATGGT. 


1 

1 


FIG. 4. DNAj G T (16, {GGG, CCC}, 7.85) code with additional constraints. 


Thus considering that DNAf GT (n, L, g) codes do not ignore word-word crosshybrization potential and 
use a more restrictive measure free energy gap, the code given in Figure 4 is much more restrictive than 
58 — 2. The number of strands in 58 — 2 is 80 while the number of pairs of strands in the exhibited 
DNAf GT {\6, {GGG, CCC}, 7.85) is N = 97. It should be noted that there are other bonding specificity 
constraints considered in 58 — 2 that are not considered in the code given in Figure 4 and, in light of the 
random coding bound data t\ and e\ in Figure 3, 58 — 2 is a good design. 
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